Keyword [Spatiotemporal Attention]

Li S, Bak S, Carr P, et al. Diversity regularized spatiotemporal attention for video-based person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 369-378.

1. Overview

1.1. Motivation

most existing methods encoding each video frame in its entirely and computing an aggregate representation across all frame
the remaining visible portions of the person may provide strong cues for re-identification
features directly generated from entire images can easily miss fine-grained visual cues

In this paper, it proposed spatialtemporal attention model

multiple spatail attention model (alignment) + diversity regularization term (Hellinger Distance) to not discover the same body
- align corresponding image patches across frames
- determining whether a particular part of the body is occluded or not
temporal attention
automatically discovers a diverse set of distinctive body parts
extract useful information from all frames without succumbing to occlusions and misalignment

1.2.1. Image-Based Person Re-id

extracting discriminative features
learning robust metrics
- Online Instance Matching Loss

1.2.2. Video-Based Person Re-id

(extension of image-based)

top-push distance
RNN
space-time

1.2.3. Attention Models for Person Re-id

avoid focus on similar region

2. Methods

2.1. Restricted Random Sampling

divide video into N chunks of equal duration
random sample an image of each chunk

2.2. Multiple Spatial Attention Models

foucs on body part, hats, bags, …

ResNet-50 (1 Conv + 4 ResBlock). 8x4 grids

L. 32
D. 2048 dimension

the weight of n-th frame, k-th attention part, l-th grid
weighted attention region
Enhance (appendix)

2.3. Diversity Regularization

collection of each region weight for n-th frame
>

K. the number of attention model
L. grids

Hellinger Distance. maximize the distance
Regularization Term. will be multipled by a coefficient and added to original OIM loss

variant

2.4. Temporal Attention

Pooling features arocss time using a per-frame weight is not sufficiently robust, since some frames could contain valuable partial information about an individual. (apply same temporal attention weight to all frames)

weight across all frames for one attention region

2.5. Overview

entire video is represented by vector x ∈ (K x D)

2.6. Re-id Loss

3. Experiments

3.1. Details

N = 6
- pretrain ResNet-50 on image-based re-identification datasets
- fixed CNN, train multiple spatial attention model (Diversity Regularization)
- fixed CNN, jointly train the whole network
SGD, 0.1 drop to 0.01
128 dimension L2-normalized

3.2. Ablation Study

3.2.1. different number of spatial attention models

treating a person as single region instead of two distinct body parts is better

(CVPR 2018) Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification

1. Overview

1.1. Motivation

1.2.1. Image-Based Person Re-id

1.2.2. Video-Based Person Re-id

1.2.3. Attention Models for Person Re-id

2. Methods

2.1. Restricted Random Sampling

2.2. Multiple Spatial Attention Models

2.3. Diversity Regularization

2.4. Temporal Attention

2.5. Overview

2.6. Re-id Loss

3. Experiments

3.1. Details

3.2. Ablation Study

3.2.1. different number of spatial attention models

3.3. Comparison

1. Overview

1.1. Motivation

1.2. Related Work

1.2.1. Image-Based Person Re-id

1.2.2. Video-Based Person Re-id

1.2.3. Attention Models for Person Re-id

2. Methods

2.1. Restricted Random Sampling

2.2. Multiple Spatial Attention Models

2.3. Diversity Regularization

2.4. Temporal Attention

2.5. Overview

2.6. Re-id Loss

3. Experiments

3.1. Details

3.2. Ablation Study

3.2.1. different number of spatial attention models

3.3. Comparison